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Abstract 

This paper presents an efficient algorithm 
for the incremental construction of a mini- 
mal acyclic sequential transducer (ST) for a 
dictionary consisting of a list of input and 
output strings. The algorithm generalises 
a known method of constructing minimal 
finite-state automata ( |Daciuk et al., 2000). 
Unlike the algorithm published by Mihov 
and Maurel <|2UU1|I . it does not require the 
input strings to be sorted. The new method 
is illustrated by an application to pronun- 
ciation dictionaries. 

1 Introduction 

Sequential transducers constitute a powerful 
formalism for storing and processing large dic- 
tionaries. Each word in the dictionary is asso- 
ciated with an annotation, e.g., phonetic tran- 
scription or a collection of syntactic features. 
Since STs are deterministic, lexical lookup can 
be performed in linear time. Space efficiency 
can be achieved by means of minimisation algo- 
rithms JMohri, 19941 |Eisner, 2003 ). 

In the present paper, we consider the follow- 
ing problem. Given a list of strings . . . 
associated with annotations 
want to construct a minimal ST T implement- 
ing the mapping f(w^') = o^\j = 1. . .m. 

The naive way of doing that would be first to 
create a (non-minimal) ST implementing / and 
then to minimise it. As pointed out by Daciuk 
et al. (2000), this can be inefficient, especially 
for large m. Instead, the same task can be per- 
formed more efficiently in an incremental way, 
i.e., by constructing a sequence of transduc- 
ers T\ . . . T m such that each Tj is the minimal 
ST implementing the restriction of the original 
mapping / to the first j words (f\{ w (i)... w U)})- 
Since the insertion of a new word w^ +l typi- 
cally affects only few states of the transducer, 
Tj + \ can be constructed from Tj by changing 
only a small part of its structure. 



Daciuk et al. (|2000|) show how to incremen- 
tally construct a minimal finite-state automa- 
ton for a list of words w\ . . . w m . Their algo- 
rithm can also be applied to transducers, but 
fails to produce a minimal ST in the general 
case. Mihov and Maurel ( 2001)) describe an al- 
gorithm that handles the ST case correctly, but 
requires the words to be sorted in advance. In 
some applications, this requirement is unreal- 
istic as lexical entries may be added dynami- 
cally to an already constructed dictionary. 1 The 
present paper presents an algorithm that does 
not make assumptions about the order of the 
list .. V m ). 

The paper is structured as follows. Section [21 
introduces definitions and notation. The ori- 
ginal algorithm for finite automata is described 
in section |3J Section 21 explains why the algo- 
rithm does not work for transducers. The re- 
quired generalisation is introduced in section [SJ 
an algorithm based on this generalisation is the 
topic of section H3 Section [7| illustrates the new 
method with a practical application. 

2 Definitions 

Definition 1. (Deterministic FSA) A deter- 
ministic finite- state automaton (DFSA) over an 
alphabet S is a quintuple A = (T,,Q,qo,6,F) 
such that Q is a finite set of states, qo £ Q 
is the initial state, F C Q the set of final 
states, and 5 : Q x £ — > Q is a (partial) transi- 
tion function. 8 can be extended to the domain 
Q x X* by the following definition: 5*(q, e) = q, 
5*(q,wa) = 5(5*(q,w),a). 

A accepts a string w € S* if q = 5*(qo,w) 
is defined and q € F. The set of all strings 
accepted by A is called the language of A and 



1 Many systems employ a built-in lexicon that is con- 
structed off-line, but may be extended with user dictio- 
naries merged in at any time in any order. The only way 
to use the sorted-data algorithm would be to unfold the 
already minimised lexicon, add the new entries, sort the 
data and re-apply the construction method. 



written L{A). An FSA is called trim if every 
state belongs to a path from qo to a final state. 

Definition 2. (Right Language) Let A = 

(Y±,Q,q Q ,6,F) be a DFSA. The right language 

La (q) of a state q in A is the set of all strings 
w such that 5*(q,w) G F. If 5*(qo,u) is defined 
for u £ T,* , the right language of u is defined as 

La (u) == La (<5*(<Zo> We omit the subscript 
whenever A can be inferred from the context. 

Definition 3. (Sequential Transducers) A 

sequential transducer (ST) over an input al- 
phabet X and an output alphabet A is a 7- 
tuple T = (X, A, Q, q , 6, a, F) such that A = 
(X, Q, q , 5, F) is a DFSA and a : Q x E -> A* 
is a function that labels transitions with emis- 
sions from A* (Dom(a) = Dom{5) ). 

Function a can be extended to Q x X* ac- 
cording to the following recursive definition: 
a*(q,e) = e,a*(q,wa) = a*(q,w)a(5*(q,w),a). 

Unless indicated otherwise, the definitions 
formulated above for DFSAs also apply to STs 

(L(T),L(u),L(q)). 

Each ST T realises a function fx ■ X* — > 
A* such that Dom(f'T) = £(T) and fri^) = 
a*(qo,u) for u G Dom(fT). A sequential func- 
tion is one that can be realised by an ST. 

Definition 4. (Minimal FSA/ST) A DFSA 
A = (Y,,Q,qo,5, F) is called minimal if \Q\ < 
\Q'\ for all DFSA A' = (X, Q', q' , 6', F') such 
thatL{A') = L(A). 

An ST T = (X, A, Q, qo, 5, a, F) is called 
minimal if \Q\ < \Q'\ for any ST T' = 
(X, A, Q', q' , 6',o-', F') such that f T , = f T . 

2.1 String Notation 

For u, v € £*, uv and u ■ v denote the concate- 
nation of u and v. u A v is the longest common 
prefix of u and v. If u is a prefix of v (writ- 
ten u < p v), u~ 1 v denotes the remainder of v: 
u ■ u~ 1 v = v. \u\ denotes the length of u, while 
u\ . . . u\ u \ are its letters, uuu denotes the sub- 
string Uj...Uk of u (^[j...fe] = e if j > k). If 
T is an ST, w A T stands for the longest prefix 
u < p w such that 5*(qo,u) is defined. 

3 Minimisation of FSAs 

The algorithm decribed by Daciuk et al. (2000) 
is iterative. In each iteration, given a mini- 
mal acyclic trim FSA A = (T,,Q,qo,8,F) and 
a word to 6 E*, the algorithm creates a DFSA 
A' = (T,,Q f ,q ,5',F') such that A' is the min- 
imal automaton for the language L(A) U {w}. 
A' then serves as input to the next iteration. 



The key notion here is the equivalence of 
states. Two states q\ , qi G Q are considered 

equivalent (written q\ = q%) iff L (qi) =L ((72)- 
A well-known result states that a trim FSA is 
minimal iff it does not contain a pair q\ , 52 of 
distinct but equivalent states: 

V 9i,g 2 eQ : 91 + q.2 Qi # q.2 (1) 

Each iteration consists of two steps, which can 
be called insertion and local minimisation? 

3.1 Insertion 

The insertion operation identifies the longest 
prefix w\ . . . w\ of w in A and the corresponding 
states qo ... qi. Some of them may be confluence 
states, i.e., states qi such that in — degree(qi) > 
1. In order to prevent overgeneration, the 
algorithm identifies the first confluence state 
qk and clones the path q^-.-qi- The cloned 
states qk ■ ■ - qi are copies of the original ones, 
i.e., 5{qi,a) = 5{qi,a) for all a G S such that 
5{qi,a) is defined - with the only exception of 
the transition consuming the next symbol of w: 
5(qi,w i+ i) = q i+1 . Furthermore, S(q k -i,w k ) := 
Qk- 

In the next step, a chain of states qi + \ ■ ■ ■ <jt, 
qt € F, consuming the remainder of w (i.e., 
wi + \ . . . Wt) is appended to q\ (if it has been cre- 
ated) or to If / = t, the remainder is the 
empty string, and we make sure that qi/qi G F. 

Formally, this step creates an automaton A = 
(X, Q, qo, 5, F) such that, for i G {k, . . . , i}: 3 

Q = QU{q k -..qt} 
F = FU{qi: qi £F} 

(4k ■ q = qk-i,a = w k 

5(q,a) = < q i+ i : q = qt,a = w i+ \ 
[ 5(q, a) : otherwise. 

This completes the insertion step. The new 
automaton A obviously accepts the language 
L(A) U {w} and preserves the right languages 
of all states except qo ■ ■ ■ Qk—i- 

3.2 Local Minimisation 

The situation after insertion is that A contains 

• a path qo ■ ■ ■ q k ~\ of states whose right lan- 
guages may have changed, 

2 The following description refers to a simpler variant 
of the algorithm rather than to the optimised version 
described in the pseudocode in the original publication. 
Optimisation is discussed separately in section lrT31 

3 We set k := I + 1 if there are no confluence states. 



• a path cjk ■ ■ ■ cjt of newly created (partly 
cloned) states, and 

• the remaining states of A, whose right lan- 
guages have not changed (i.e., Q still holds 
for Q\{q . ..g fc _i}). 

In order to make the new FSA minimal, the 
algorithm must enforce condition Q for the 
states qo . . . qk-iqk ■ ■ ■ Qt by replacing them, if 
possible, by their equivalents in a set Q^, which 
is initially set to Q\{qo ■ ■ • Qk-i}- 

The sequence is traversed in reverse order, 
starting from qt- In the j-th iteration (j = 
1 . . . t — 2), the algorithm checks if there is al- 
ready a state q' 6 equivalent to the current 
state q. If such a q' exists, q is replaced by q' 
(i.e., q is deleted and all transitions reaching q 
are redirected to q'). Otherwise, q is added to 
Q^t. In this way, the algorithm gets rid of du- 
plicates w.r.t. the equivalence relation =. The 
automaton left after the last iteration satisfies 
condition (|T|). i.e., it is minimal. 

3.3 The State Register 

The efficiency of the algorithm depends on its 
ability to quickly check the equivalence of states. 
This check is fast because 5*(q,u) € for all 
q £ Qjt (at any stage of the local minimisa- 
tion step). In effect, q\ = qi <^=^ Out(q{) = 
Out(q 2 ) A (f/i,(/2 6 F V qi,q2 F), where 
Out(q) = {(a,q') : S(q,a) = q'} is the set of 
transitions leaving q. Thus, for each q £ Q^, 
the pair (Out(q),q) is put on a register, i.e., 
an associative container that maps sets of pairs 
(input, state) (uniquely identifying a right lan- 
guage) to the corresponding states in Q^. 

4 Application to Transducers 

The problem for sequential transducers can be 
stated as follows: given a minimal ST T im- 
plementing a sequential function /, we want to 
insert into T a string w associated with an emis- 
sion o, creating a minimal ST for / U {(w,o)}. 
Daciuk et al. (2000) state on this topic: 

This new algorithm can also be 
used to construct transducers. The al- 
phabet of the (transducing) automa- 
ton would be Si x £ 2 , where Si and 
S2 are the alphabets of the levels. Al- 
ternatively, as previously described, el- 
ements of S2 can be associated with 
the final states of the dictionary and 
only output once a valid word from S* 
is recognised. 



Unfortunately, both suggested solutions are 
problematic. They require that we commit our- 
selves to a particular alignment of input and 
output symbols in the transitions in advance, 
before running the algorithm. For instance, 
consider the fragment of a pronunciation dic- 
tionary shown below. 

but I b uh t 
bite I b ai t 
cut I k uh t 
cite I s ai t 

Obviously, there are several string-to-string 
transducers that implement this dictionary. 
One possibility would be to encode the mapping 
in a phonologically motivated way, i.e., associat- 
ing each phonetic symbol with the grapheme(s) 
it corresponds to. Unfortunately, the result of 
applying an FSA minimisation algorithm is non- 
deterministic (figure 




e:EPS 



Figure 1: A "phonologically motivated" align- 
ment of input and output symbols. 

The second suggestion made by Daciuk et 
al. (|2000|) . the use of final emissions, can be em- 
ulated using a special end-of-string symbol $. 
The result of FSA minimisation, shown at the 
top of figure[21 is an ST, but not minimal, since 
it has more states than the ST shown below. 

5 ST Minimality Criteria 

In order to adapt the original algorithm to 
transducers, we need to find an ST counter- 
part of the = relation defined for finite-state au- 
tomata. The following proposition constitutes a 
good point of departure. 

Proposition 1. \Mohri, 199%) If f : S* -► A* 

is a sequential function, there exists a minimal 
FST T = (Y,,A,Q,i,6,a,F) realising f. The 
size \T\ (= \Q\) ofT is equal to the count of the 
equivalence relation Rf defined as 

u R fV ^ 2 (u) =2 (v) A (2) 
3 u 'y e A* V -j u'^fiuw) = v'~ l f(vw) 




=: follows immediately with u' = a*(qo,u), 



v' = a*{q ,v), since £ (u) d = C (5*(q ,u)). 
=^: Let g := <5*(<7o,«), := 5*(q f o,'u). If uRfV 
— * J2I 

then £ (g) =£ and there exist u',?/ G A 
such that: 



k uh t 



kuh 



t:t\ 

5; $:EPS 
$:EPS ^ 



Figure 2: An ST with final emissions and a mi- 
nimal ST. 

Rf is defined on the set (X*) 2 . In order to 
adapt the original algorithm, we must define an 
equivalence relation on Q, analogous to =. It 
turns out that this is possible for transducers 
that are prefix-normalised, as stated in the fol- 
lowing definition. 

Definition 5. A sequential transducer T = 
(£, A, Q, i, 5, a, F) is prefix-normalised if 



A M 



UZ 



(3) 



The following proposition allows us to define 
an equivalence relation on Q analogous to Rf. 

Proposition 2. If a trim sequential transducer 
T = (T,,A,Q,i,5,a,F) that realises a function 
f : X* — » A* is prefix-normalised, then the 
count of Rf is equal to the count of the relation 
=StC Q 2 defined as follows: 



q =st q 



C (q) =C (q) A 



Proof. Since T is trim, it is sufficient to prove 
uRfv 5*(q ,u) = S T o~*(qo,v) 



V : u 1 f(uw) = v' 1 f(vw) 



Since f(uw) 
equivalent to: 



a*{qo,u)a*{q,w), this is 



v! a*(q ,u)a*(q,w) = v' a*(q ,v)a*(q',w) 

Furthermore, v! and v' must be prefixes of 
a*(qo,u) and a*(qo,v), respectively (otherwise, 
T would not be prefix- normalised), thus there 
exist u",v" such that u'u" = a*(qo,u),v'v" = 
o~*(qo, v) and 

V -j :u"a*(q,w)=v"a*(q',w) 

This holds only if u" is a prefix of v" or vice 
versa. Without loss of generality assume v" = 
u" z. Then it follows: 

Therefore, for all w ££ (q'), z is a prefix of 
a*(5*(qo,v),w). Since T is prefix- normalised, 
this implies z = e, i.e. u" = v" , hence 
V -x, a*(5*(q ,u),w) = a*(5*{q ,v),w), i.e., 

q =st q'- □ 
6 The Algorithm 

According to proposition El a modification of 
the original algorithm (by Daciuk et al. (2000)) 
shall produce a minimal ST if = is replaced by 
=st, and the transducer being constructed is 
prefix-normalised in each iteration. As in the 
original approach, each iteration is a two-step 
operation: first, a new word-output pair (w, o) 
is inserted into a minimal, trim and prefix- 
normalised transducer T, creating a prefix- 
normalised, "almost minimal" transducer T. In 
the second step, the "redundant" states on the 
path of w are merged with equivalent states 
in T, resulting in a minimal, trim and prefix- 
normalised ST implementing fx U {(u>, o)}. 

The modifications to the FSA algorithm re- 
quired in order to adapt it to sequential trans- 
ducers are discussed in sect ions fo . 1 1 and l6~2l The 
pseudocode is presented in section 16.31 



6.1 Insertion 

Like the original algorithm, the modified one 
identifies wAT = q$. . .qi and creates new states 
q~l + i ■ ■ - <Jt- It also clones the path . . . qi from 
the first confluence state (if there is one). 

The main complication is due to the insertion 
of the output sequence o. In order for the ST to 
be well- formed, the output cr*(qo, w A T) gener- 
ated by the prefix w AT must be a prefix of o. 
However, the original output a*(qo,w AT) in T 
might not meet this requirement. 

The solution is to "push" some of the outputs 
away from the path of the prefix w AT = HJnn. 
However, one must be careful not to change the 
right languages (and their translations) of the 
states that are not on the path of w. Further- 
more, proposition |21 requires the resulting ST to 
be prefix-normalised. 

All this can be achieved by the following re- 
cursive definition of ex. 4 

a(ri,w i+ x) := (<5-*(g ,ii'[i...i]))" 1 

(O A(T*(? ,!ii[l.., i+ l])) 

a(ri,a) := {a*(q , ^[i...i])) _1 

c*(<?o,^[i...i]a),a / m+i 
An example of insertion is shown in figure El 




Figure 3: Insertion of pair (aaaa, axxx). 

As for the suffix W[/+i...t], we associate the 
output remainder 0"*(<7o,iO[i...z]) o with the 
first transition. The remaining ones emit e. 

This insertion mechanism ensures that T im- 
plements the mapping fx U {(w,o)} without 
changing the equivalence of states not on the 
path q . . . qk-i- 

6.2 Local Minimisation 

As in the FSA algorithm, the path 
<7o ■ • ■ Qk-lQk ■ ■ ■ Qt is traversed in reverse 

4 The symbol n ranges over qi, qi (whenever the latter 
is defined). If there is no confluence state, k := I + 1. 



order. Each state q for which there exists an 
equivalent state q' € Q^t is replaced by q'. 

What changes is the way the equivalence of 
two states qi,q2, Q2 € is determined. Ob- 
viously, now it is no longer sufficient to com- 
pare the target states and input labels of all 
transitions leaving q\ and qi- However, since 
T is prefix- normalised, the only extra bit to 
check is the output, i.e., we keep the original 
formula q\ =st qi Out(q) = Out{q') A 

(gi, qi G F V qi, qi F), but redefine Out(q) as 
{(a, o, q') : S(q, a) = q' A a(q, a) = o}. 

6.3 Pseudocode 

In each iteration of the main loop, the algorithm 
takes a pair ( Word, Output) and calls the proce- 
dures insert () and remove_duplicates(), re- 
sponsible respectively for the insertion and the 
local minimisation step. The former traverses 
the word from left to right, clones the path 
from the first confluence state down (if there is 
any), and ends up in the state 5*(qo,w A T), 
or its copy. From there, insert_sijffix() is 
called, creating a chain of states corresponding 
to the remainder of Word (The pseudocode of 
insert_SUFFIx() is omitted for space reasons, 
but its functionality is very simple: the remain- 
der of Output is emitted in the first transition of 
this new chain of states). 5 In each iteration of 
the f or-loop, the variable Output holds the re- 
mainder of the original output that has not been 
emitted so far. PUSH_OUTPUTS(S'taie, Residual) 
takes care of making the path outputs in T com- 
patible with o = ff(o). After i iterations of the 
loop, the argument Residual holds the value of 
(oAcr*(g ,W[i...i-i]))~ 1 c r *(g ,W[i...i]), i.e., the re- 
mainder of cr*(qo, ttfp,...t-i]) after subtracting the 
longest common prefix with o. This prefix is 
prepended to the output labels of all transitions 
starting in State.® 

The procedure remove_duplicates() tra- 
verses the path of w in T (in reverse order) and 
removes those states for which there is an equiv- 
alent state in the register. The remaining states 
are added to the register (note that the pro- 
cedure insert_suffix() de-registers all states 
from the root down to the first confluence state 



If w AT = w and there is some output left, then 
/t U {(w, o)} is not sequential. 

6 Note that if State G F and Output / e, fr U{(w, o)} 
is not sequential. 



- if such a state exists - or to the end of w A T) . 
Algorithm 6.1: ConstructMinSTQ 



Register <— 

while there is a word-output pair 
( Word, Output) <— next pair 
insert ( Word, Output) 

REMOVE_DUPLICATES( Word) 

procedure insert( Word, Output) 
State <— qo 

FoundConfluence <— false 
for i <— 1 to size(Word) 
if State £ Register 

Register <— Register\{State} 
Symbol *— W^ord[i] 
Child 8{State, Symbol) 
if CMd = break 
if InDegree(Child) > 1 

FoundConfluence <— irue 
if FoundConfluence 

S(State, Symbol) <— CLONE(Child) 
OutputPrefix <— Output A a(State, Symbol) 
OutputSuffix <— 

OutputPrefix^ 1 a (State, Symbol) 
Output <— OutputPrefix^ Output 
a (State, Symbol) <— OutputPrefix 
State i— 6(State, Symbol) 
PUSH_OUTPUTS(S'tote, OutputSuffix) 
rof 

insert_SUFFIX( State, Word[i^ stze ( Word )], Output) 

procedure remove_DUPLICATES( Word) 
for i <— ,size( Word) — 1 downto 1 
State <- (5* (<j , Word[l...i]) 
Symbol <— Word[i + 1] 
C/iiZrf <— 6 (State, Symbol) 
if 3g £ Register, q =st Child 
S(State, Symbol) <— <j 
else Register Register U {Child} 
rof 

procedure PUSH_OUTPUTS(State, Residual) 
for each a £ E 

if 5(State,a) is defined 
a(State, a) <— Residual ■ a(State, a) 

6.4 Extensions 

The algorithm can be extended to the more gen- 
eral case of subsequential transducers (SST). An 
SST is an ST that emits /ma£ outputs when halt- 
ing in a final state flMohri, 1997| ). It can be 
emulated in the ST framework by appending a 
special end-of-string character $ to each string, 
making each final state q in the transducer non- 
final and adding a transition from q via $ to a 
new final state qf. The output associated with 
this transition is the ST equivalent of a final 



output. In this encoding, the new algorithm 
can be used to construct minimal subsequential 
transducers. 

Alternatively, the algorithm can be directly 
modified to cope with final outputs. In this 
case, the equivalence relation =st needs to be 
refined by requiring that two equivalent final 
states must also have identical final outputs. 

In some cases, the mapping / : w — > o is not 
a function. For example, a word in a pronunci- 
ation dictionary may have two or more tran- 
scriptions. Such cases of bounded ambiguity 
can be handled by another extension of the ST 
framework, namely p- subsequential transducers, 
in which each final state is associated with up to 
p final outputs ( Mohr i, 1997 ) . The present algo- 
rithm can be extended to this case by employing 
p different end-of-string symbols $i 



as m 



the case of SSTs. This technique was used in 
the application described in section [7J 

6.5 Complexity and Optimisation 

For a dictionary of m words, the main loop 
of the algorithm executes m times. The loops 
in procedures insert () (including the call to 
INSERT _SUFFIX()) and REMOVE_DUPLICATES() 
are each executed times for each word w. 
Putting a state on a register may be done in 
constant time when using a hash map. 

Compared to the FSA algorithm, the ST gen- 
eralisation has one more complexity component, 
namely the procedure PUSH_outputs(), which 
is executed in each iteration of the loop in func- 
tion insertQ. Each call to PUSH_ouTPUTs(g) 
involves OutDegree(q) operations. 

In practical implementation, there is also 
some overhead stemming from the use of more 
complex data structures (because of the need 
to store transition outputs). This mainly af- 
fects the efficiency of the register lookup and 
the insert () procedure. 

The algorithm can be optimised by reducing 
the number of times states are registered/de- 
registered during the processing of the prefix of 
w in T (main loop of insertQ). More precisely, 
the idea is to deregister a state only if there is 
any residual output pushed down the trie (i.e., 
the previous value of OutputSuffix was other 
than e). As a result, some states q\ . . . q s , s < k, 
may stay registered after the call to insertQ. 
The loop in remove_duplicates() must then 
check whether or not 5(qi,Wi + i) = qi+i- If not, 
gj+i must have been replaced by an equivalent 
state. In such a case, we must de-register qi and 



check if there are equivalent states in the regis- 
ter. As soon as one of the q^s is not replaced, 
there is no need to perform this check for the 
remaining states . . . q\. 

This optimisation idea is used in the original 
algorithm. As for STs, the speed-up achieved 
is moderate because push_outputs() typically 
changes most of the outputs on the path of w. 

7 Applications and Evaluation 

The new algorithm has been employed to con- 
struct pronunciation lexica in the rVoice text- 
to-speech system. 7 In languages such as En- 
glish, where the relation between orthography 
and pronunciation is not straightforward, it is 
often advantageous to store all known words 
in the dictionary, rather than rely on letter-to- 
sound rules (Fac krell and Skut, 20040 . The al- 
gorithm makes it possible to store large amounts 
of data in such dictionaries without affecting the 
efficiency and flexibility of the system: the re- 
sulting representations are very compact, words 
can be looked up deterministically in linear 
time, and user-defined entries can be inserted 
into the dictionary at any time in any order 
(unlike in Maurel and Mihov's approach). This 
last feature in particularly important as rVoice 
users can control the behaviour of the system 
by dynamically inserting their own entries into 
the dictionary at runtime. 

The performance of the algorithm has 
been evaluated by constructing a minimal 
ST for a pronunciation dictionary compris- 
ing the 50,000 most frequent British sur- 
names. The size and the construction time 
for the ST were compared to the equivalent 
parameters for the sorted-data ST algorithm 
( Mihov and Maurel, 2001D and the uns orted- 
data FSA algorithm flDaciuk et al., 2000| >. The 
dictionary was not sorted, so there was an ex- 
tra sorting step in the case of Maurel and Mi- 
hov's algorithm (sorting took less than 1 sec. 
and is not included in the reported execution 
time). In the FSA, the phonetic transcriptions 
were encoded as final emissions (i.e., the FSA 
encoded the language {w^o^, . . . ,w^o^ m '}, 
each phonetic symbol serving as an additional 
symbol of the input alphabet). The ST encod- 
ing used a special end-of-string symbol $ ap- 
pended to each word in order to make sure the 
resulting mapping was rational. 8 The results 



are shown in tabled 





ST-unsorted 


ST-sorted 


FSA 


states 


22,211 


22,211 


161,592 


arcs 


67,129 


67,129 


211,327 


time 


19 sec 


12 sec 


22 sec 



Table 1: Comparison of three construction 
methods (unsorted-data ST, sorted-data ST 
and unsorted-data FSA) applied to a pronunci- 
ation lexicon on a Pentium 4 1.7 GHz processor. 

The comparison demonstrates that STs are 
superior to FSAs as an encoding method for lex- 
ica annotated with rich respresentations. FSA 
minimisation is obviously of little help if every 
(or almost every) input is associated with a dif- 
ferent annotation; almost no states are merged 
in the part of the FSA encoding the it?W's. 9 
Since the FSA is much larger, construction takes 
longer than in the ST case although the FSA al- 
gorithm is faster on structures of equal size. 

Not surprisingly, the sorted-data algorithm 
is faster than the unsorted-data version, even 
including the actual sorting time. However, 
its limited flexibility restricts its applicability, 
leaving the new unsorted-data algorithm as the 
preferable option in a range of applications. 
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